The Dataset I am using is obtained from kaggle.com, created from Human Development Reports.
Gross National Income Per Capita is a key economic indicator that measures the average income earned by individuals in a country. Disparities in GNI per capita among different countries can highlight variations in income distribution and economic development levels.
This dataset is widely used to compare and analyze the economic performance and prosperity of different nations over time.
GNI/gni short for Gross National Income
Original link: [https://www.kaggle.com/datasets/iamsouravbanerjee/gross-national-income-per-capita]
Usability:10.00
gni <- read.csv("Gross National Income Per Capita.csv")
# Disable the results are printing in scientific notation
options(scipen = 999)In the “Environment” pane, We could have a basic glance of the raw data. Noticed there are 195 observations and 39 variables.
suppressMessages(library(tidyverse))
suppressMessages(library(shiny))
suppressMessages(library(knitr))
suppressMessages(library(formattable))
suppressMessages(library(crayon))
suppressMessages(library(plotly))
suppressMessages(library(gridExtra))
suppressMessages(library(kableExtra))
suppressMessages(library(maps))
suppressMessages(library(DT))
suppressMessages(library(shinythemes))
suppressMessages(library(scales))The tidyverse is an collection of R packages designed for data science. The core tidyverse includes the packages like ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats.
# View the structure of the data frame
str(gni)
# Get summary statistics
summary(gni)
# View the first few rows
head(gni)By checking the data frame and summary, there are 6 character variables (also contain 3 categorical variables: Continent, Hemisphere, Human.Development.Groups, and UNDP.Developing.Regions). Others are all numerical variables (HDI.Rank..2021. refers to Human Development Index Rank for 2021, therefore it is an integer variable. Other 32 variables are specific GNI data of each country, rounded to four decimal places.)
Also, observed that a time frame from 1990 to 2021 contained in those numerical variable names. Exactly 32 years of data to be recorded.
By checking the first few rows, we could immediately see some data is
missing under the column of UNDP.Developing.Regions. Further
investigation based on the UNDP website (https://hdr.undp.org/data-center/documentation-and-downloads)
– and click on Developing regions. Now we are clear on those
obs./countries without a UNDP.Developing.Regions value
are all developed countries. Also, list all developing regions
and their abbreviations since we will use the data later:
- AS: Arab States
- EAP: East Asia and the Pacific
- ECA: Europe and Central Asia
- LAC: Latin America and the Caribbean
- SA: South Asia
- SSA: Sub-Saharan Africa
# check whether those obs./countries without a UNDP.Developing.Regions value are developed countries
selected_rows <- gni %>%
select(Country, HDI.Rank..2021., UNDP.Developing.Regions) %>%
filter(UNDP.Developing.Regions == "") %>%
arrange(HDI.Rank..2021.)
kable(head(selected_rows,5), row.names = FALSE, booktabs = TRUE )| Country | HDI.Rank..2021. | UNDP.Developing.Regions |
|---|---|---|
| Switzerland | 1 | |
| Norway | 2 | |
| Iceland | 3 | |
| Hong Kong | 4 | |
| Australia | 5 |
ggplot(data = gni, aes(x = Country, y = Gross.National.Income.Per.Capita..1990. )) +
geom_point() +
geom_smooth(method="lm", formula = y ~ x, se = FALSE)+
labs(title = "Scatterplot of Country and GNI Per Capita in 1990", x = "Country", y = "GNI Per Capita") +
theme(
axis.text.x = element_blank())# Create a line pot using AUS & BRA data
aus_values <-gni[9,8:ncol(gni)]
bra_values <-gni[24,8:ncol(gni)]
x_values <- colnames(gni)[8:ncol(gni)]
# Use regular expression to replace original colnames with only number
new_column_names <- sub('.*\\.(\\d+)\\..*', '\\1', x_values)
aus_bra_gni <- data.frame(x = new_column_names, y1 =as.numeric(aus_values), y2 =as.numeric(bra_values) )
ggplot(data = aus_bra_gni, aes(x, y, group = 1)) +
geom_line(aes(y=y1,color="Australia")) +
geom_line(aes(y=y2, color="Brazil")) +
geom_point(aes(y = y1), color = "blue") +
geom_point(aes(y = y2), color = "green") +
labs (title = "GNI Per Capita Trends in Australia and Brazil",x="Year",y="GNI Per Capita", color="Country") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) This line plot with two variables is highly readable than last plot, we could see a huge income gap between AUS and BRA. Also the trend of Australian economy development is continuously going up throughout the years while Brazil’s economy was relatively steady and had undergone some drawback after the year of 2013.
Although, we can not just analyze each country’s GNI trends(also skip col.2-5) but focus on some more general outputs. So next step, we will do data cleaning to deal with missing data and data transformation to categorize all the obs. according to the listed categorical variables.
Since we will do a lot of works on selecting each country’s GNI data, I create a function called country_desc, it takes two parameters, country name and certain data to be shown(rank, group and GNI data from 1990-2021)
country_desc <- function(dataset, country, description) {
selected_country <- dataset[gni$Country == country, ]
if ("rank" %in% description ) {
return (selected_country$HDI.Rank..2021.)
}
if ("group" %in% description ) {
return (selected_country$Human.Development.Groups)
}
if ("gniData" %in% description ) {
return (selected_country[, 8:39])
}
}
# check if it's working as we expect
country_desc(gni, "United States", "rank")## [1] 21
## [1] "Very High"
We will use country_desc(gni, “country_name”,“gniData”) a lot in later analysis.
Also, for better visualization, in above line plot code chunk, we replace original very long and repetitive colnames with only year number which makes the data frame and plots a lot more easier to read. Therefore create a function in case we will use it on copied gni data frame.
extract_year <- function(dataset, option) {
x_values <- colnames(dataset)[8:ncol(dataset)]
new_column_names <- sub('.*\\.(\\d+)\\..*', '\\1', x_values)
new_data <- dataset
colnames(new_data)[8:ncol(new_data)] <- new_column_names
if("colname" %in% option) {
return (new_column_names)
}
if ("new" %in% option) {
return (new_data)
}
}
extract_year(gni, "colname")## [1] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021"
Noted:
More functions will be listed in later code chuck. Like the multiple_line(country_names) I created is very powerful. It takes multiple country names(a vector) as parameter and return a line plot with the 1990-2021 GNI data of these countries. and hd_group_pie(continent_name) will create a pie chart presenting the HD group proportion in aimed continent. Also get_group_summary(group_column, summary_column) to check the total number of four different human development groups in each category.
# Select missing values ( both NA values and empty values are considered)
sum(is.na(gni))
# Using the function we created to extract only year, new as option means the output will be a new data frame instead of imputing colnames on our original dataset
colSums(is.na(extract_year(gni, "new")) | gni =="")
# Get the Country names containing missing values
# And we already aware col6 represent developing/developed country, so skip this col
deve_missing <- gni$Country[which(is.null(gni[, 5])| gni[, 5]=="")] # this col missing values are "" not NA
# Then create a function to list the country name with missing values
find_missing_countries <- function(col_index) {
missing_countries <- gni$Country[which(is.na(gni[, col_index]))]
return(missing_countries)
}
# Test
find_missing_countries(7)
print(gni$Country[which(is.na(gni[, 7]))])
# By checking the colSums and raw csv file, countries with missing values are the same unless data is not NA in the next year. So we only select col 8,13,23,26
# Create a missing value table
row1 <- paste(deve_missing, collapse = ", ")
row2 <- paste(find_missing_countries(7), collapse = ", ")
row3 <- paste(find_missing_countries(8), collapse = ", ")
row4 <- paste(find_missing_countries(13), collapse = ", ")
row5 <- paste(find_missing_countries(23), collapse = ", ")
row6 <- paste(find_missing_countries(26), collapse = ", ")
row_name_item <-c("Development", "2021rank","1990-1994","1995-2004","2005-2007","2008-2021")
country_missing_data <- data.frame(Countries = c(row1, row2, row3, row4, row5, row6), row.names = row_name_item)Perfect! Now with some formatting works, we have a decent table to consider about missing values.
formattable(country_missing_data, list(
Name = formatter("span", style = ~ style(font.weight = "bold"))
))| Countries | |
|---|---|
| Development | Monaco, Nauru, North Korea, Somalia |
| 2021rank | Monaco, Nauru, North Korea, Somalia |
| 1990-1994 | Bosnia and Herzegovina, Croatia, Monaco, North Macedonia, Montenegro, Nauru, North Korea, Serbia, South Sudan, Slovenia, Samoa |
| 1995-2004 | Monaco, Nauru, North Korea, South Sudan |
| 2005-2007 | Monaco, North Korea, South Sudan |
| 2008-2021 | Monaco, North Korea |
Based on some researches on authority websites, it’s clear that
Monaco, Nauru are microstates with very little population.
North Korea and Somalia do not regularly report data due to
conflict, lack of statistical capacity, and political reasons. And, some
countries do not have data in the period of 1990-1994 simply because
they did not exist (e.g. countries of the former Soviet Union, South
Sudan).
Reference: [https://datahelpdesk.worldbank.org/knowledgebase/articles/191133-why-are-some-data-not-available]
(https://datahelpdesk.worldbank.org/knowledgebase/articles/191133-why-are-some-data-not-available)
**In this case, there are not so many NA values and they are spreading inconsistently within different columns, so I won’t drop them or replace NA with other data for now.
Mainly, because data from those countries is not critical and won’t affect my further data visualization. Also, statistic data related to nation must be obtained from official sources therefore I can’t use mean, median or regression imputation to impute missing values.**
I will deal with these data case by case.
By using the function created earlier, we create a new data frame with clean and short colname (since we already know what each col represent for). Also round floats to integers.
gni_only_year <- extract_year(gni, "new")
gni_clean <- gni_only_year %>%
rename(Groups = Human.Development.Groups, DevelopingRegions = UNDP.Developing.Regions, HDRank = HDI.Rank..2021.) %>%
mutate(across(8:ncol(.), ~ round(., 0)))
head(gni_clean[, 1:8], n = 5)## ISO3 Country Continent Hemisphere Groups DevelopingRegions HDRank 1990
## 1 AFG Afghanistan Asia Northern Hemisphere Low SA 180 2685
## 2 AGO Angola Africa Southern Hemisphere Medium SSA 148 4846
## 3 ALB Albania Europe Northern Hemisphere High ECA 67 4742
## 4 AND Andorra Europe Northern Hemisphere Very High 40 43773
## 5 ARE United Arab Emirates Asia Northern Hemisphere Very High AS 26 102433
Col6 represent developing/developed country, therefore creating new data frames according to this information
# Create a developing country data frame using clean gni data
developing_gni <- gni_clean %>%
filter(DevelopingRegions != "")
# Similarly create a developed country data frame and delete DevelopingRegions column
developed_gni <- gni_clean %>%
filter(DevelopingRegions == "") %>%
select(-DevelopingRegions)
#check
#head(developed_gni[, 1:8], n = 5)
datatable(developed_gni[, 1:8])Sorting data based On Different Categorical Variables
# Col. 3-6 check the sum of each category, use these as parameters to Subgroup Data and try to get some useful thoughts
# check the total number of four different human development groups in each category
get_group_summary <- function(group_column, summary_column) {
group_summary <- gni %>%
filter(Human.Development.Groups != "") %>%
group_by({{ group_column }}, Human.Development.Groups) %>%
summarize(country_count = n(),
.groups = "drop")
return(group_summary)
}
# Let's check the sum of countries
developing_region_group <- get_group_summary(UNDP.Developing.Regions, Human.Development.Groups)%>%
filter(UNDP.Developing.Regions != "")
knitr::kable(get_group_summary(Continent, Human.Development.Groups), caption = "Count of HD Group by Continent") %>%
kable_material(c("striped", "hover")) %>%
scroll_box(width = "500px", height = "300px")| Continent | Human.Development.Groups | country_count |
|---|---|---|
| Africa | High | 7 |
| Africa | Low | 28 |
| Africa | Medium | 17 |
| Africa | Very High | 1 |
| America | High | 18 |
| America | Low | 1 |
| America | Medium | 7 |
| America | Very High | 9 |
| Asia | High | 14 |
| Asia | Low | 3 |
| Asia | Medium | 13 |
| Asia | Very High | 18 |
| Europe | High | 6 |
| Europe | Very High | 36 |
| Oceania | High | 4 |
| Oceania | Medium | 7 |
| Oceania | Very High | 2 |
Note that, from the developing_region_group table, some developing countries also fall into human development groups that are above average. And almost all the developed countries are in the group of Very High Human Development.
The Human Development Index (HDI) measures human development by average achievement in three basic dimensions — standard of living, education and health, which are assessed by three indicators — GNI per capita, expected years of schooling and life expectancy. https://hdr.undp.org/data-center/human-development-index#/indicies/HDI
We will discuss and visualize the dependencies between Gross per capits and HDI rank by development groups later with scatter plots.
Use Growth Rate & AVG values to check the GNI trends of each countries, and make comparisons
no_na_gni <- na.omit(gni)
growth_rate <- round((no_na_gni[, 39] - no_na_gni[, 8]) / no_na_gni[, 8] *100,0) ## drop na
mean = round(rowMeans(no_na_gni[8:39], na.rm = TRUE),0)
## create a new data frame and we will be using Growth Rate $ AVG as Indicator a lot later
growth_table_new <- data.frame(Country = no_na_gni$Country,
Continent = no_na_gni$Continent,
Latest = no_na_gni$Gross.National.Income.Per.Capita..2021.,
GrowthRate = growth_rate,
AVG = mean)
datatable(growth_table_new)continent_rank <- gni %>%
group_by(Continent) %>%
summarize(
min_rank = min(HDI.Rank..2021., na.rm = TRUE),
max_rank = max(HDI.Rank..2021., na.rm = TRUE),
average_gni = mean(Gross.National.Income.Per.Capita..2021., na.rm = TRUE),
.groups = "drop"
)
formattable(continent_rank, list(
Name = formatter("span", style = ~ style(font.weight = "bold"))
))| Continent | min_rank | max_rank | average_gni |
|---|---|---|---|
| Africa | 63 | 191 | 5338.534 |
| America | 15 | 163 | 16628.430 |
| Asia | 4 | 183 | 22112.316 |
| Europe | 1 | 80 | 42366.594 |
| Oceania | 5 | 156 | 12518.539 |
In this table, the average of GNI in each continent is listed indicating the differences of average income earned by individuals between these continents in 2021. Individuals in European countries clearly have the highest income level while Africa only has one eighth of that level. And Asian countries rank the middle surpassing America and Oceania. We could get some universally known facts based on above tables. Moving on to the next step, we will focus on GNI values.
Since we already took average GNI into consideration, and it seems a better indicator for data visualization. I will create some copies of GNI data with a new column of average GNI of each countries and consider whether NA should be dropped or colnames should be extracted.
# calculate the average GNI from 1990-2021 of each country
with_avg <- gni %>%
mutate(`Average GNI (1990-2021)` = rowMeans(.[8:39]))
# Sort the average in descending order for better visualization
sorted_avg <- with_avg[order(-with_avg$`Average GNI (1990-2021)`), ]
head(extract_year(sorted_avg, "new"))
# Drop na & ""
no_na_gni <- na.omit(gni)continent_summary <- growth_table_new %>%
group_by(Continent) %>%
summarize(
avg_gni = round(mean(AVG),1),
min_gni = round(min(Latest),1),min_gni_country = Country[which.min(Latest)],
max_gni = round(max(Latest),1),max_gni_country = Country[which.max(Latest)],
`avg_rate%` = round(mean(GrowthRate),1),
`min_rate%` = min(GrowthRate), min_rate_country = Country[which.min(GrowthRate)],
`max_rate%` = max(GrowthRate), max_rate_country = Country[which.max(GrowthRate)]
)
formattable(continent_summary,list(
Name = formatter("span", style = ~ style(font.weight = "bold"))
))| Continent | avg_gni | min_gni | min_gni_country | max_gni | max_gni_country | avg_rate% | min_rate% | min_rate_country | max_rate% | max_rate_country |
|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 4653.2 | 731.8 | Burundi | 25830.6 | Seychelles | 57.2 | -36 | The Democratic Republic of the Congo | 239 | Equatorial Guinea |
| America | 14376.1 | 2847.5 | Haiti | 64765.2 | United States | 85.3 | -71 | Venezuela | 996 | Guyana |
| Asia | 19224.0 | 1314.3 | Yemen | 90918.6 | Singapore | 129.5 | -52 | Yemen | 1092 | China |
| Europe | 36870.6 | 13255.5 | Ukraine | 146829.7 | Liechtenstein | 70.9 | -21 | Ukraine | 226 | Ireland |
| Oceania | 10682.0 | 2481.5 | Solomon Islands | 49238.4 | Australia | 48.6 | 3 | Palau | 116 | Kiribati |
We will have a clear understanding of min and max values within each countries and this table helps to work with further data exploratory
## create a histogram based on above data transformation works
count_his <- ggplot(data = growth_table_new, aes(x=Latest)) +
geom_histogram(binwidth = 5000,color="white", fill="lightblue") +
geom_vline(aes(xintercept=mean(Latest)),
color="blue", linetype="dashed", size=1) +
stat_bin(aes(y=..count.., label=..count..), geom="text", vjust=-.5) +
labs(
title = "Histogram of 2021 GNI Per Capita Data",
x = "GNI Range",
y = "Count of Countries"
)
count_his## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most countries’ GNI per capita is lower than 25000, and the blue line represents the mean level. The range of 10000-15000 has the most countries.
Few countries have GNI values over 70000, and an outliner shows up with income level way higher than other high income countries.
# create a Choropleth Maps using plotly
map_avg <- plot_ly(with_avg,
locations = ~Country,
locationmode = 'country names',
z = ~`Average GNI (1990-2021)`,
type = 'choropleth',
colors = colorRamp(c("#4786EC", "yellow", "red")),
text = ~paste("Country: ", Country, "<br> Average GNI: ", `Average GNI (1990-2021)`),
hoverinfo = "text") %>%
layout(title = "Hover or zoom in to check country details!",
geo = list(showframe = FALSE, showcoastlines = FALSE, projection = list(type = 'mercator'))) %>%
colorbar(title = "Average GNI per Capita")
map_avgThis dynamic map vividly depicts the average GNI values of each country using all the data we have in the dataset. Compared to our first trial, the data of all observed objects can be found intuitively. Very useful when we are trying to understand a country’s income level and simply compare the data between countries.
NA values are ignored, and from the map we further understand these countries are not critical at all, we could hardly find the location of Nauru on a global scale. Also countries with highest AVG GNI also not easy to see in the map(shown in red), because they are too small.
# Create a bar chart on above sorted data
sorted_bar <- plot_ly(sorted_avg,
x = ~reorder(Country, -`Average GNI (1990-2021)`),
y = ~`Average GNI (1990-2021)`,
type = 'bar',
text = ~paste("Country: ", Country, "<br> Average GNI: ", `Average GNI (1990-2021)`),
hoverinfo = "text",
marker = list( opacity = 0.65, line = list(color = 'F3B605', width = 6))) %>%
layout(title = ' Average GNI (1990-2021) From Highest Country To Lowest Country ',
xaxis = list(title = 'Country',tickangle = -90), yaxis = list(title = 'Average GNI'),
height = 500, width = 1000 )
sorted_barThis bar chart depicts countries that rank from the highest GNI to the lowest. Because of width setting, just the first of four countries in descending order will be shown in x-axis.
Also can be seen, there are countries with extreme values, so next step we will break down all countries to continents and discuss outliers, mean, and quartiles attributes.
# Create a box plot for each continent
no_na_avg <- with_avg[!is.na(with_avg$`Average GNI (1990-2021)`), ] ## drop na
continent_combined <- ggplot(no_na_avg, aes(x = Continent, y = `Average GNI (1990-2021)`, fill = Continent)) +
geom_boxplot(position = position_dodge(width = 0.8)) +
stat_summary(fun = "mean", geom = "point", shape = 8,
size = 2, color = "white")+
labs(title = 'Box Plot of Average GNI by Continent ( *is average value)',
x = 'Continent',
y = 'Average GNI') +
theme_minimal()
continent_combinedWe could directly see the median value and data distribution of each continent and easily know the income gap between continents.
Every continent has outliners, especially the one in Europe.
Values in Asia and Oceania are distributed highly uneven, which means there are more countries with below-average income level in Asia and Oceania. On the contrary, income levels in American, European and African countries are relatively evenly distributed when we ignore extreme values.
Interestingly, Asia and Europe have a much wider range of income levels than other continents.
#Use the function I created before get_group_summary <- function(group_column, summary_column)
hd_group <- get_group_summary(Continent, Human.Development.Groups)
#Create a new function to present the HD group proportion in aimed continent
hd_group_pie <- function(continent_name) {
data_subset <- hd_group %>%
filter(Continent == continent_name)
pie_chart <- ggplot(data = data_subset, aes(x = "", y = country_count, fill = Human.Development.Groups)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = paste("HD Group Proportion in", continent_name),
fill = "HD Group") +
theme_void()
return(pie_chart)
}
# Present five continents
continent_names <- c("Asia", "Africa", "Europe", "America", "Oceania")
pie_charts <- list()
# Create pie chart for each continent and use for to add them to the list
for (continent in continent_names) {
pie_charts[[continent]] <- hd_group_pie(continent)
}
grid.arrange(grobs = pie_charts, ncol = 2) Combined with what we know from the last box plot, GNI values in Asian countries are distributed highly uneven, so we will take a close look at Asian HDI VS GNI.
asia <- gni_clean %>%
filter(Continent == "Asia")
ggplot(data = asia, aes(x = HDRank, y = `2021`, color= Groups)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "HDI Rank (2021)", y = "Gross National Income Per Capita (2021)", color = "Human Development Groups") +
ggtitle("Gross National Income Per Capita (2021) and HDI Rank (2021) by Development Group") +
theme_minimal() +
theme(legend.position = "top") +
coord_cartesian(clip = "off") ## `geom_smooth()` using formula = 'y ~ x'
Clearly, in the very high development group of countries spreadness of Gross National income per capita is much higher. As can be seen at the pink scatters and its regression line.
GNI per capita and human development index do have a strong correlation
In this plot we will see countries that fall into the median range of the average gni in each countries.
# Group by Continent and calculate median Average GNI
continent_median <- no_na_avg %>%
group_by(Continent) %>%
summarize(median_GNI = median(`Average GNI (1990-2021)`))
continent_median## # A tibble: 5 × 2
## Continent median_GNI
## <chr> <dbl>
## 1 Africa 2829.
## 2 America 11479.
## 3 Asia 8208.
## 4 Europe 37081.
## 5 Oceania 4402.
Then manually choose five countries in each continent have a GNI around the median values as our objects
# Using function country_desc(dataset, country, description)) to select gniData of Zimbabwe, Tonga, Georgia, Brazil, Greece
# We could also turn this into a function if we want to compare multiple countries' data at the same time
multiple_line <- function(country_names) {
y_values <- lapply(country_names, function(country) {
as.numeric(country_desc(gni, country, "gniData"))
})
df <- data.frame(x = extract_year(gni, "colname"),
country = rep(country_names, each = length(y_values[[1]])),
y = unlist(y_values))
ggplot(df, aes(x, y, color = country, group = country)) +
geom_line() +
labs(title = "GNI Per Capita Trends in Selected Countries",
x = "Year", y = "GNI Per Capita", color = "Country") +
scale_color_discrete(guide = "legend") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
# Example usage
multiple_line(c("Zimbabwe", "Tonga", "Georgia", "Brazil", "Greece"))# Similarly, we create a line plot of countries with max GNI value but not Extreme Value to compare between continent
# Singapore, Switzerland, Canada, New Zealand, South Africa
multiple_line(c("Singapore", "Switzerland", "Canada", "New Zealand", "South Africa"))As can be seen from the plot:
Now, the growth rate comes up to be our next indicator.
options(scipen = 999)
growth_rate <- round((no_na_gni[, 39] - no_na_gni[, 8]) / no_na_gni[, 8] *100,0)
growth_table <- data.frame(Country = no_na_gni$Country, Continent = no_na_gni$Continent, GrowthRate = growth_rate, Latest = no_na_gni$Gross.National.Income.Per.Capita..2021.)
summary(growth_rate) # check summary of growthRate to set xlim y lim## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -71.00 18.00 54.00 83.66 104.50 1092.00
ggplot(growth_table, aes(x=Latest, y=GrowthRate)) +
geom_point(aes(col=Continent, size = 1.5)) +
geom_smooth(method = "lm", se = FALSE) +
geom_density_2d_filled(alpha = 0.5) + # check dispersion
xlim(c(700, 67000)) + # Drop some outliers
ylim(c(-50, 200)) +
labs(title="Latest GNI Data Vs Growth Rate(1990-2021)",
x="Gross.National.Income.Per.Capita..2021",
y="Growth Rate %") ## `geom_smooth()` using formula = 'y ~ x'
As for continent
# select countries with highest Growth rate
low_gni <- growth_table[growth_table$GrowthRate > 100 , ]
plot_ly(
low_gni, type = "scatter", mode = "markers", x=~Latest, y=~GrowthRate,
color=~Continent, size=~GrowthRate,
text = ~paste(Country, "<br>Grow Rate:", GrowthRate, "%<br>GNI:", Latest)
) %>%
layout(
title = "Countries With GNI Growth Rate Higher Than 100% <br>Hover to see country details!!!",
titlefont = list(size = 20),
xaxis = list(title = "Gross.National.Income.Per.Capita..2021"),
yaxis = list(title = "Growth Rate %", range = c(100, 1200)),
margin = list(t = 100)
)Most of countries with growth rate more than 200% are in Asia, and South Korea is the only two countries in the world also has GNI more than 40,000 (the other is Ireland).
These two scatter plots could check the distribution of countries in terms of the two dimensions of 2021 gni values and growth rates. They are both very important indicators to understand economic development level and sustainability, also helpful if we want to further forecast certain country’s income level.
# Load functions created in Rmd file and will be use in shiny
extract_year <- function(dataset, option) {
x_values <- colnames(dataset)[8:ncol(dataset)]
new_column_names <- sub('.*\\.(\\d+)\\..*', '\\1', x_values)
new_data <- dataset
colnames(new_data)[8:ncol(new_data)] <- new_column_names
if("colname" %in% option) {
return (new_column_names)
}
if ("new" %in% option) {
return (new_data)
}
}
country_gni <- function(country, description) {
selected_country <- gni[gni$Country == country, ]
if ("rank" %in% description ) {
print(selected_country$HDI.Rank..2021.)
}
if ("group" %in% description ) {
print(selected_country$Human.Development.Groups)
}
if ("gniData" %in% description ) {
print(selected_country[, 8:39])
}
}
multiple_line <- function(country_names) {
y_values <- lapply(country_names, function(country) {
as.numeric(country_desc(gni, country, "gniData"))
})
df <- data.frame(x = extract_year(gni, "colname"),
country = rep(country_names, each = length(y_values[[1]])),
y = unlist(y_values))
ggplot(df, aes(x, y, color = country, group = country)) +
geom_line() +
labs(title = "GNI Per Capita Trends in Selected Countries",
x = "Year", y = "GNI Per Capita", color = "Country") +
scale_color_discrete(guide = "legend") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
get_specific_summary <- function(group_column, specific) {
group_summary <- no_na_gni %>%
filter({{ group_column }} == {{ specific }} ) %>%
group_by({{ group_column }}, Human.Development.Groups) %>%
summarize(country_count = n(),
.groups = "drop")
return(group_summary)
}
no_na_gni <- na.omit(gni)
# Call the extract_year function to extract year
year <- extract_year(gni, "colname")
shinyApp(
## Define UI
ui <- fluidPage(theme = shinytheme("cosmo"),
# In the navbar, we will have FIVE tabs to present different plots with unique functions
navbarPage(
"Gross National Income Per Capita",
# Tab 1: Line & Point - Country
tabPanel("Line&Point-Country",
sidebarPanel(
selectInput(inputId = "country_input1",
label = "Choose Country 1:",
choices = unique(gni$Country)),
selectInput(inputId = "country_input2",
label = "Choose Country 2:",
choices = unique(gni$Country)),br(),
selectInput(inputId = "country_input3",
label = "Country Data Details:",
choices = unique(gni$Country)),
actionButton("update_button", "Update Plot"),
width = 3
),
mainPanel(
fluidRow(
plotOutput("plot1", height = "400px", width = "900px"),
tags$div(style = "height: 10px;"),
plotlyOutput("plot2", height = "400px", width = "800px")
)
)
),
# Tab 2: Pie - HD group
tabPanel("Pie-HD group",
sidebarPanel(
selectInput(inputId = "input_continent",
label = "Choose Continent:",
choices = unique(gni$Continent)),
selectInput(inputId = "input_region",
label = "Choose Region:",
choices = unique(no_na_gni$UNDP.Developing.Regions)),
helpText(
tags$p(
style = "font-size: 12px;",br(),
"# The Human Development Index (HDI)" ,br(),"measures human development by average achievement in three basic dimensions: standard of living, education and health,
which are assessed by three indicators: GNI per capita, expected years of schooling and life expectancy.",br(),br(),
"# UNDP.Developing.Regions." ,br(),"- AS: Arab States" ,br(),"
- EAP: East Asia and the Pacific" ,br(),"
- ECA: Europe and Central Asia" ,br(),"
- LAC: Latin America and the Caribbean" ,br(),"
- SA: South Asia" ,br(),"
- SSA: Sub-Saharan Africa "
)
),
width = 3
),
mainPanel(
fluidRow(
plotOutput("plot3", height = "400px", width = "900px"),
tags$div(style = "height: 20px;"),
plotOutput("plot4", height = "400px", width = "900px")
)
)
),
# Tab 3: Historgram - GNI
tabPanel("Historgram-2021GNI",
sidebarPanel(
sliderInput("bins", "Customize the partitions:", min = 0, max = 150000, value = 2500),
width = 2
),
mainPanel(
fluidRow(
plotOutput("plot5", height = "400px", width = "900px")
)
)
),
# Tab 4:Scatter - Growth Rate
tabPanel("Scatter-Growth Rate",
sidebarPanel(
width = 2,offset = 2,
sliderInput("rate", "Select Growth Rate Range (100%):", min = -100, max = 400, value = 100),
helpText(
tags$p(
style = "font-size: 19px;", br(), br(), br(), br(),br(),br(), br(), br(),br(),br(), br(), br(), br(),br(),
"# View dispersion of data points", br(), "by Continent or Global"
)
),
selectInput(inputId = "input_continent_scatter",
label = "Choose Continent:",
choices = c("Global", unique(gni$Continent)))
),
mainPanel(
width = 9,
plotlyOutput("plot6", height = "500px", width = "700px"),
tags$div(style = "height: 30px;"),
plotOutput("plot7", height = "600px", width = "700px")
)
),
# Tab 5:Box - Details
tabPanel("BOX-Continent Details",
sidebarPanel(
width = 2,offset = 2,
checkboxGroupInput("Details", "Select Where You Want To Know:",
choices = unique(gni$Continent))
),
mainPanel(
fluidRow(
plotOutput("plot8", height = "600px", width = "900px"),br(),br(),br(), br(), br(),
tableOutput("table")
)
))
)),
# Define the server
server <- function(input, output, session) {
options(warn = -1)
output$plot1 <- renderPlot({
validate(
need(input$update_button > 0, "Please select a country. Click 'Update Plot' to generate the plot")
)
countryone <- input$country_input1
countrytwo <- input$country_input2
countryone_gni <- country_gni(countryone, "gniData")
countrytwo_gni <- country_gni(countrytwo, "gniData")
two_gni <- data.frame(x = year, y1 =as.numeric(countryone_gni[1,]), y2 =as.numeric(countrytwo_gni[1,]) )
ggplot(data = two_gni, aes(x = year, y, group = 1)) +
geom_line(aes(y = y1, color = countryone)) +
geom_point(aes(y = y1, color = countryone)) +
geom_line(aes(y = y2, color = countrytwo)) +
geom_point(aes(y = y2, color = countrytwo)) +
labs(title =paste(countryone, "&", countrytwo, "GNI Per Capita Trend Over Years") ,
x = "Year", y = "Gross National Income", color = "Country") +
theme(plot.title = element_text(hjust = 0.5, face = "bold", size=16),
axis.text.x = element_text(angle = 45, hjust = 1))
})
## Another plot with interactive line and point plot to present selected country's GNI data at certain year
# Using Plotly
output$plot2 <- renderPlotly({
validate(
need(input$update_button > 0, "Please select a country. Click 'Update Plot' to generate the plot")
)
selected_country <- input$country_input3
country_gni_data <- country_gni(selected_country, "gniData")
country_data <- data.frame(x = year, y = as.numeric(country_gni_data[1,]))
p <- plot_ly(data = country_data, x = ~x, y = ~y, type = 'scatter', mode = 'lines+markers', name = selected_country) %>%
layout(title = paste(selected_country, "GNI Per Capita Over Years # Hover to check specific data!!!"),
xaxis = list(title = "Year", tickangle = -45),
yaxis = list(title = "Gross National Income"))
p
})
## tab 2 HD group
output$plot3 <- renderPlot({
hd_group <- data.frame(get_specific_summary(Continent, input$input_continent))
ggplot(data = hd_group, aes(x="", y = country_count, fill = Human.Development.Groups)) +
geom_bar(stat = "identity", width = 1) +
geom_text(aes(label = paste(country_count, " (", scales::percent(country_count / sum(country_count)), ")")),
position = position_stack(vjust = 0.5),
size = 5, family = "bold") +
coord_polar("y", start = 0) +
labs(title = paste("HD Group Proportions in", input$input_continent),
fill = "HD Group") +
theme(
plot.title = element_text(size = 16)
)
})
output$plot4 <- renderPlot({
hd_group <- data.frame(get_specific_summary(UNDP.Developing.Regions, input$input_region))
ggplot(data = hd_group, aes(x="", y = country_count, fill = Human.Development.Groups)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = paste(country_count, " (", scales::percent(country_count / sum(country_count)), ")")),
position = position_stack(vjust = 0.5),
size = 5, family = "bold") +
labs(title = paste("HD Group Proportions in", input$input_region, " (Only Count Developing Countries)"),
fill = "HD Group") +
theme(
plot.title = element_text(size = 16)
)
})
##tab 3 GNI&Growth Rate
output$plot5 <- renderPlot({
options(scipen = 999)
latest_table <- data.frame(Country = no_na_gni$Country,
Continent = no_na_gni$Continent,
Latest = no_na_gni$Gross.National.Income.Per.Capita..2021.)
ggplot(data = latest_table, aes(x = Latest, fill = Continent)) +
geom_histogram(binwidth = input$bins) +
geom_vline(aes(xintercept=mean(Latest)), color="darkblue", linetype="dashed", size=1)+
labs(
title = "Histogram of GNI Per Capita Data",
x = "GNI Range",
y = "Count of Countries"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.4, hjust = 0.5, face = "bold", size = 14),
axis.text.y = element_text(size = 16),
plot.title = element_text(size = 20))
}, height = 700, width = 800)
## tab4 scatter
output$plot6 <- renderPlotly({
growth_rate <- round((no_na_gni[, 39] - no_na_gni[, 8]) / no_na_gni[, 8] * 100, 0)
growth_scatter <- data.frame(Country = no_na_gni$Country,
Continent = no_na_gni$Continent,
GrowthRate = growth_rate,
Latest = no_na_gni$Gross.National.Income.Per.Capita..2021.)
# Set the y-axis range according to the range of the sliderInput
y_range <- case_when(
input$rate <= 0 ~ c(-100, 0), input$rate > -0 && input$rate<=100 ~ c(0, 100), input$rate > 100 && input$rate<=200 ~ c(100, 200),
input$rate > 200 && input$rate<=300 ~ c(200, 300), input$rate > 300 && input$rate<=400 ~ c(300, 400)
)
plot_ly(growth_scatter,
type = "scatter", mode = "markers", x = ~Latest, y = ~GrowthRate,
color = ~Continent, size = ~GrowthRate,
text = ~paste(Country, "<br>Grow Rate:", GrowthRate, "%<br>GNI:", Latest)
) %>%
layout(
title = list(text = "Latest GNI Per Capita Vs Growth Rate(1990-2021) <br>Hover to see country details!!!",
size = 20),
xaxis = list(title = "Gross.National.Income.Per.Capita..2021"),
yaxis = list(title = "Growth Rate %", range = y_range),
margin = list(t = 100),
height = 500, width = 700
)
})
output$plot7 <- renderPlot({
growth_rate <- round((no_na_gni[, 39] - no_na_gni[, 8]) / no_na_gni[, 8] *100,0)
growth_table <- data.frame(Country = no_na_gni$Country, Continent = no_na_gni$Continent, GrowthRate = growth_rate, Latest = no_na_gni$Gross.National.Income.Per.Capita..2021.)
selected_continent <- input$input_continent_scatter
aim <- growth_table %>% filter(Continent == selected_continent)
# set unique range of aimed continent
min_gni <- min(aim$Latest)
max_gni <-max(aim$Latest)
min_rate <-min(aim$GrowthRate)
max_rate <-max(aim$GrowthRate)
if (selected_continent == "Global") {
ggplot(growth_table, aes(x = Latest, y = GrowthRate)) +
geom_point(aes(col = Continent, size = 1.5)) +
geom_density_2d_filled(alpha = 0.5) +
xlim(c(700, 67000)) +
ylim(c(-50, 200)) +
labs(title = paste(selected_continent, " GNI Data Vs Growth Rate(1990-2021)"),
x = "Gross.National.Income.Per.Capita..2021",
y = "Growth Rate %") +
theme(
plot.title = element_text(size = 16),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14)
)
} else {
ggplot(aim,
aes(x = Latest, y = GrowthRate)) +
geom_point(aes(col = Continent, size = Latest),color="blue") +
geom_density_2d_filled(alpha = 0.5) +
xlim(c(min_gni, max_gni)) +
ylim(c(min_rate, max_rate)) +
labs(title = paste(selected_continent, " GNI Data Vs Growth Rate(1990-2021)"),
x = "Gross.National.Income.Per.Capita..2021",
y = "Growth Rate %") +
theme(
plot.title = element_text(size = 16),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14)
)
}
})
output$plot8 <- renderPlot({
selected_continents <- input$Details
filtered_data <- gni %>%
filter(Continent %in% selected_continents) %>%
mutate(`Average GNI (1990-2021)` = rowMeans(.[8:39]))
continent_combined <- ggplot(filtered_data, aes(x = Continent, y = `Average GNI (1990-2021)`, fill = Continent)) +
geom_boxplot(position = position_dodge(width = 0.8)) +
stat_summary(fun = "mean", geom = "point", shape = 8,
size = 3, color = "white")+
labs(title = 'Box Plot of Average GNI by Continent',
x = 'Continent',
y = 'Average GNI') +
theme_minimal()+
theme(
plot.title = element_text(size = 16),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14)
)
continent_combined
})
output$table <- renderTable({
growth_rate <- round((no_na_gni[, 39] - no_na_gni[, 8]) / no_na_gni[, 8] * 100, 0) ## drop na
mean = round(rowMeans(no_na_gni[8:39], na.rm = TRUE), 0)
## create a new data frame and we will be using Growth Rate $ AVG as Indicator a lot later
growth_table_new <- data.frame(
Country = no_na_gni$Country,
Continent = no_na_gni$Continent,
Latest = no_na_gni$Gross.National.Income.Per.Capita..2021.,
GrowthRate = growth_rate,
AVG = mean
)
selected_continent <- input$Details
continent_summary <- growth_table_new %>%
group_by(Continent) %>%
summarize(
avg_gni = round(mean(AVG), 1),
min_gni = round(min(Latest), 1),
min_gni_country = Country[which.min(Latest)],
max_gni = round(max(Latest), 1),
max_gni_country = Country[which.max(Latest)],
`avg_rate%` = round(mean(GrowthRate), 1),
`min_rate%` = min(GrowthRate),
min_rate_country = Country[which.min(GrowthRate)],
`max_rate%` = max(GrowthRate),
max_rate_country = Country[which.max(GrowthRate)]
) %>%
filter(Continent %in% selected_continent)
})
}
)